3 Vector Data (sf basics)

In this chapter, we will work with the following packages. Before starting any exercises, make sure to run the following code.

Learning objectives:

  1. Understand the types and structures of vector data (points, lines, and polygons)

  2. Being able to produce a map using vector data

  3. Being able to manipulate vector data

3.1 Vector Data in GIS

In Geographic Information Systems (GIS), vector data is one of the primary ways to represent geographic features on a map. Vector data uses geometric shapes – specifically points, lines, and polygons – to model real-world objects.

Each vector feature can also carry attribute data, which are stored in a table linked to the spatial features. For example, a polygon representing a park might include attributes such as its name, size, or type of vegetation.

Vector data is particularly useful for representing clearly defined boundaries and discrete features, and it allows for detailed spatial analysis and accurate map production.

3.2 Point

Points represent discrete locations that have no area or length, such as the location of a weather station, a tree, or a city. Each point has a pair of coordinates (latitude and longitude or x and y) that indicate its position.

The sample data used in Chapter 2 is an example of a point vector layer. Let’s take a closer look at this dataset (saved as data/sf_finsync_nc.rds in the shared repository).

sf_site <- readRDS("data/sf_finsync_nc.rds")
print(sf_site)
## Simple feature collection with 122 features and 1 field
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -84.02452 ymin: 34.53497 xmax: -76.74611 ymax: 36.54083
## Geodetic CRS:  WGS 84
## # A tibble: 122 × 2
##    site_id                          geometry
##    <chr>                         <POINT [°]>
##  1 finsync_nrs_nc-10013 (-81.51025 36.11188)
##  2 finsync_nrs_nc-10014 (-80.35989 35.87616)
##  3 finsync_nrs_nc-10020 (-81.74472 35.64379)
##  4 finsync_nrs_nc-10023  (-82.77898 35.6822)
##  5 finsync_nrs_nc-10024 (-77.75384 35.38553)
##  6 finsync_nrs_nc-10027 (-83.69678 35.02467)
##  7 finsync_nrs_nc-10029 (-80.65668 35.16119)
##  8 finsync_nrs_nc-10034 (-82.04497 36.09917)
##  9 finsync_nrs_nc-10041 (-80.46558 35.04365)
## 10 finsync_nrs_nc-10049 (-77.96322 34.58249)
## # ℹ 112 more rows

If you examine the geometry column, you’ll see that it contains pairs of latitude and longitude values with the notation <POINT [°]>, which specify the location of each site. Using this geographic information, we visualized the survey sites on a map in Chapter 2.1. We can map this data with mapview::mapview() function:

mapview(sf_site,
        col.regions = "black", # point's fill color
        legend = FALSE) # disable legend

3.3 Line

Lines (also called polylines) represent linear features such as roads, rivers, or trails. A line consists of a sequence of connected points and may include curves or bends. Lines have length, but no area.

Stream lines are an example of line geometries.
We will use a sample dataset stored in data/sf_stream.rds, which illustrate stream networks within Guilford county, NC. You can load and inspect it in R as follows:

sf_str <- readRDS("data/sf_stream.rds")
print(sf_str)
## Simple feature collection with 5012 features and 1 field
## Geometry type: LINESTRING
## Dimension:     XY
## Bounding box:  xmin: -80.04671 ymin: 35.9001 xmax: -79.53284 ymax: 36.25706
## Geodetic CRS:  WGS 84
## First 10 features:
##                          geometry       fid
## 1  LINESTRING (-79.90748 36.18... fid000001
## 2  LINESTRING (-79.89768 36.20... fid000002
## 3  LINESTRING (-79.94756 36.15... fid000003
## 4  LINESTRING (-79.94152 36.25... fid000004
## 5  LINESTRING (-79.94206 36.20... fid000005
## 6  LINESTRING (-79.90003 36.18... fid000006
## 7  LINESTRING (-79.94737 36.20... fid000007
## 8  LINESTRING (-79.87595 36.18... fid000008
## 9  LINESTRING (-79.88022 36.25... fid000009
## 10 LINESTRING (-79.94859 36.25... fid000010

In contrast to the point vector layer introduced earlier, this dataset’s geometry column contains the notation LINESTRING, indicating that the features represent linear geometries—specifically, stream segments. Let’s visualize this data to better understand its structure:

mapview(sf_str,
        color = "steelblue", # line's color
        legend = FALSE) # disable legend

3.4 Polygon

Polygons represent areas such as lakes, parks, or country boundaries. A polygon is formed by a closed sequence of lines that define its perimeter, allowing it to enclose a space and have both area and shape. As an example, we’ll use county-level polygon data from North Carolina:

sf_nc_county <- readRDS("data/sf_nc_county.rds")
print(sf_nc_county)
## Simple feature collection with 100 features and 1 field
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -84.32377 ymin: 33.88212 xmax: -75.45662 ymax: 36.58973
## Geodetic CRS:  WGS 84
## First 10 features:
##         county                       geometry
## 1         ashe MULTIPOLYGON (((-81.47258 3...
## 2    alleghany MULTIPOLYGON (((-81.23971 3...
## 3        surry MULTIPOLYGON (((-80.45614 3...
## 4    currituck MULTIPOLYGON (((-76.00863 3...
## 5  northampton MULTIPOLYGON (((-77.21736 3...
## 6     hertford MULTIPOLYGON (((-76.74474 3...
## 7       camden MULTIPOLYGON (((-76.00863 3...
## 8        gates MULTIPOLYGON (((-76.56218 3...
## 9       warren MULTIPOLYGON (((-78.30849 3...
## 10      stokes MULTIPOLYGON (((-80.02545 3...

In the geometry column, you’ll notice the notation MULTIPOLYGON, which indicates that each feature consists of one or more connected polygons. These are classified as polygon vectors in GIS and are commonly used to represent areas with defined boundaries. Let’s visualize these polygons as well:

mapview(sf_nc_county,
        col.regions = "grey", # polygon's fill color
        legend = FALSE) # disable legend

3.5 Mapping in ggplot

To visualize the three types of vector data—points, lines, and polygons—together, we can use the ggplot2 package with its geom_sf() function, which natively supports spatial features.

We’ll start by plotting just the polygon layer to show the county boundaries:

ggplot() +
  geom_sf(data = sf_nc_county)

Next, we add the line layer, which represents stream networks, on top of the county polygons:

ggplot() +
  geom_sf(data = sf_nc_county) +
  geom_sf(data = sf_str)

Finally, we add the point layer to the map, which marks the survey sites. This completes the map by showing all three vector types together:

ggplot() +
  geom_sf(data = sf_nc_county) +
  geom_sf(data = sf_str) +
  geom_sf(data = sf_site)

3.6 Spatial Data Manipulation

While the map we created earlier provides a good overview, it may appear odd because the stream network is only available for Guilford County, yet the other layers (such as survey sites and county boundaries) span the entire state. To better align the spatial representation, we might want to focus on Guilford County across all layers.

One of the major benefits of using R GIS analysis is that spatial features can be manipulated just like regular data frames. This kind of spatial subsetting is also possible in platforms like ArcGIS or QGIS, but often involves tedious click-and-save workflows, which I personally try to avoid.

To narrow down the survey sites to only those that fall within Guilford County, we can use the st_join() function from the sf package. This function performs a spatial join, associating attributes from one layer (e.g., counties) to another (e.g., point locations) based on their geographic overlap.

Here, we overlay sf_site (survey sites) with sf_nc_county (county polygons) to attach county information to each point:

sf_site_join <- st_join(x = sf_site, # base layer
                        y = sf_nc_county) # overlaying layer

In the original sf_site data, there was no column identifying the county for each site:

print(sf_site)
## Simple feature collection with 122 features and 1 field
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -84.02452 ymin: 34.53497 xmax: -76.74611 ymax: 36.54083
## Geodetic CRS:  WGS 84
## # A tibble: 122 × 2
##    site_id                          geometry
##    <chr>                         <POINT [°]>
##  1 finsync_nrs_nc-10013 (-81.51025 36.11188)
##  2 finsync_nrs_nc-10014 (-80.35989 35.87616)
##  3 finsync_nrs_nc-10020 (-81.74472 35.64379)
##  4 finsync_nrs_nc-10023  (-82.77898 35.6822)
##  5 finsync_nrs_nc-10024 (-77.75384 35.38553)
##  6 finsync_nrs_nc-10027 (-83.69678 35.02467)
##  7 finsync_nrs_nc-10029 (-80.65668 35.16119)
##  8 finsync_nrs_nc-10034 (-82.04497 36.09917)
##  9 finsync_nrs_nc-10041 (-80.46558 35.04365)
## 10 finsync_nrs_nc-10049 (-77.96322 34.58249)
## # ℹ 112 more rows

After running st_join(), the resulting object sf_site_join now includes additional attributes from the county layer—most notably, a county column:

print(sf_site_join)
## Simple feature collection with 122 features and 2 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -84.02452 ymin: 34.53497 xmax: -76.74611 ymax: 36.54083
## Geodetic CRS:  WGS 84
## # A tibble: 122 × 3
##    site_id                          geometry county     
##  * <chr>                         <POINT [°]> <chr>      
##  1 finsync_nrs_nc-10013 (-81.51025 36.11188) wilkes     
##  2 finsync_nrs_nc-10014 (-80.35989 35.87616) davidson   
##  3 finsync_nrs_nc-10020 (-81.74472 35.64379) burke      
##  4 finsync_nrs_nc-10023  (-82.77898 35.6822) buncombe   
##  5 finsync_nrs_nc-10024 (-77.75384 35.38553) greene     
##  6 finsync_nrs_nc-10027 (-83.69678 35.02467) clay       
##  7 finsync_nrs_nc-10029 (-80.65668 35.16119) mecklenburg
##  8 finsync_nrs_nc-10034 (-82.04497 36.09917) avery      
##  9 finsync_nrs_nc-10041 (-80.46558 35.04365) union      
## 10 finsync_nrs_nc-10049 (-77.96322 34.58249) pender     
## # ℹ 112 more rows

This column isn’t randomly assigned; it reflects the actual geographic relationship: each survey site inherits the attributes of the county polygon it falls within, based on spatial coordinates.

Now that each survey site has a county identifier, we can easily subset the points located within Guilford County using familiar tidyverse syntax:

sf_site_guilford <- sf_site_join %>% 
  filter(county == "guilford")

We can also extract just the Guilford County polygon from the full county dataset:

sf_nc_guilford <- sf_nc_county %>% 
  filter(county == "guilford")

With these filtered layers, we can re-create the map – this time focusing exclusively on Guilford County and its associated stream network and survey sites:

ggplot() +
  geom_sf(data = sf_nc_guilford) +
  geom_sf(data = sf_str) +
  geom_sf(data = sf_site_guilford)

If we wish, we can customize the colors of each layer and apply a clean base theme to enhance the appearance:

ggplot() +
  geom_sf(data = sf_nc_guilford) +
  geom_sf(data = sf_str,
          color = "steelblue") +
  geom_sf(data = sf_site_guilford,
          color = "salmon") +
  theme_bw()